AMD Flang Weno Optimizations#1374
Conversation
Claude Code ReviewHead SHA: 26e982e Files changed:
Findings1. WENO5 final The PR replaces omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2))
omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2))
omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2))This hardcodes 2. Intermediate The do q = 0, weno_num_stencils
alpha(q) = d_cbL/R_${XYZ}$(q, j)/(beta(q)**2._wp)
end do
omega = alpha/sum(alpha) ! <- still an array operation
do q = 0, weno_num_stencils
alpha(q) = ...remapped formula using omega(q)...
end doThe surrounding 3. WENO3 final The WENO5 path's final |
… into weno-array-unwrap
📝 WalkthroughWalkthroughThis pull request contains two primary modifications: First, the CMakeLists.txt expands compile options for OpenMP GPU offloading with the LLVMFlang compiler, adding 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/simulation/m_weno.fpp (2)
1107-1109: Optional: hoist the shared denominator and consider applying the same unroll to order-3/order-7.Two minor points on the order-5 omega normalization changes at 1107-1109 and 1148-1150:
The expression
(alpha(0) + alpha(1) + alpha(2))is recomputed three times per side. A compiler will likely CSE, but an explicit temporary makes intent clearer and is cheaper to read/verify:- omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2)) - omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2)) - omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2)) + tau = alpha(0) + alpha(1) + alpha(2) ! reuse scratch or add a local + omega(0) = alpha(0)/tau + omega(1) = alpha(1)/tau + omega(2) = alpha(2)/tau(Use a dedicated local, not
tau, if it's still live elsewhere — here it is not after this point in the left block. For the right block at 1148-1150,tauis still reusable bywenozabove, so add a fresh scalar and declare itprivatein the outerGPU_PARALLEL_LOOP.)The same
omega = alpha/sum(alpha)pattern still exists at lines 986, 1015 (order-3), 1304, 1366 (order-7), and inside the order-5mapped_weno/tenobranches (1070, 1094). If the AMD Flang perf regression you targeted is specifically the arraysum()reduction, consider applying the unroll uniformly (or justify in a comment why only the order-5 final normalization was unrolled). As-is, the half-done refactor is a small consistency wart.
965-983: AddGPU_LOOP(parallelism='[seq]')markers to newqloops for consistency with adjacent code.The newly introduced
do q = 0, weno_num_stencilsloops insideGPU_PARALLEL_LOOPkernels (order-3 at lines 965–983 and 996–1012, order-5 at lines 1063–1075 and 1123–1135, order-7 at lines 1264–1276 and 1340–1352) lack explicit$:GPU_LOOP(parallelism='[seq]')annotations, while siblingwenoz/tenoqloops in the same kernels carry these markers (e.g., lines 1080, 1088, 1096, 1137, 1142, 1280, 1292, 1354, 1360). Although inner loops default to sequential execution when unmarked (confirmed across nvfortran, Cray ftn, and AMD Flang), adding explicitseqmarkers maintains code clarity and consistency with established patterns in this function.♻️ Example for order-3 left block
if (wenojs) then + $:GPU_LOOP(parallelism='[seq]') do q = 0, weno_num_stencils alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp) end do else if (mapped_weno) then + $:GPU_LOOP(parallelism='[seq]') do q = 0, weno_num_stencils alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp) end do omega = alpha/sum(alpha) + $:GPU_LOOP(parallelism='[seq]') do q = 0, weno_num_stencils alpha(q) = (d_cbL_${XYZ}$ (q, j)*(1._wp + d_cbL_${XYZ}$ (q, & & j) - 3._wp*omega(q)) + omega(q)**2._wp)*(omega(q)/(d_cbL_${XYZ}$ (q, & & j)**2._wp + omega(q)*(1._wp - 2._wp*d_cbL_${XYZ}$ (q, j)))) end do else if (wenoz) then ! Borges, et al. (2008) tau = abs(beta(1) - beta(0)) do q = 0, weno_num_stencils alpha(q) = d_cbR_${XYZ}$ (q, j)*(1._wp + tau/beta(q)) end doApply the same treatment to the mirrored right-side blocks (lines 996–1012, 1123–1135, 1340–1352) and order-5/7 counterparts (lines 1063–1075, 1264–1276).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: da41e88b-a73a-4de0-abe0-bebeba2682cb
📒 Files selected for processing (2)
CMakeLists.txtsrc/simulation/m_weno.fpp
| target_link_options(${a_target} PRIVATE -fopenmp) | ||
| elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang") | ||
| target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a) | ||
| target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
fc="${FC:-flang}"
if ! command -v "$fc" >/dev/null 2>&1; then
echo "Fortran compiler '$fc' not found; set FC to the LLVMFlang/flang binary to verify these flags." >&2
exit 2
fi
"$fc" --help 2>&1 | grep -E -- \
'-fopenmp-target-fast|-fopenmp-assume-threads-oversubscription|-fopenmp-assume-teams-oversubscription|--offload-arch' \
|| {
echo "One or more LLVMFlang OpenMP offload flags are not advertised by this compiler." >&2
exit 1
}Repository: MFlowCode/MFC
Length of output: 152
🏁 Script executed:
# Check if CMakeLists.txt exists and get file size
wc -l CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 76
🏁 Script executed:
# Read lines 630-650 to see the context around line 639
sed -n '630,650p' CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 1484
🏁 Script executed:
# Search for GPU architecture configuration patterns in CMakeLists.txt
rg -n "offload-arch|AMDGPU|AMD_GPU|GPU_ARCH" CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 352
🏁 Script executed:
# Search for LLVMFlang or flang compiler handling
rg -n "LLVMFlang\|flang" CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
# Search for GPU architecture configuration patterns more broadly
rg -n "GPU|ARCH|offload" CMakeLists.txt | head -30Repository: MFlowCode/MFC
Length of output: 1453
🏁 Script executed:
# Search for MFC_* CMake options
rg -n "^option\|^set.*CACHE" CMakeLists.txt | head -40Repository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
# Get more context around the LLVMFlang block - lines 625-660
sed -n '625,660p' CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 2309
🏁 Script executed:
# Check if there are any other references to gfx90a or MI250X in the repo
rg -n "gfx90a|MI250X|AMD" --type-add 'cmake:*.cmake' -t cmakeRepository: MFlowCode/MFC
Length of output: 559
🏁 Script executed:
# Search for any existing CMake configuration or detection of AMD GPU architecture
rg -n "AMDGPU\|amdgpu\|ROCM\|rocm\|HIP" CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
# Check git history to see if these lines were modified in this PR
git log -p --follow -S "offload-arch=gfx90a" -- CMakeLists.txt | head -100Repository: MFlowCode/MFC
Length of output: 39
🏁 Script executed:
# Get the git diff to see what actually changed in this PR
git diff HEAD~1 CMakeLists.txt | grep -A 5 -B 5 "LLVMFlang" || echo "No diff found or file unchanged"Repository: MFlowCode/MFC
Length of output: 292
🏁 Script executed:
# Search for any mention of build type (Release/Debug) constraints on GPU flags
rg -n "CMAKE_BUILD_TYPE\|Release\|Debug" CMakeLists.txt | head -20Repository: MFlowCode/MFC
Length of output: 39
Gate the LLVMFlang fast-offload assumptions and GPU arch.
Lines 639–640 hardcode --offload-arch=gfx90a for both compile and link, and line 639 unconditionally enables -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription for all LLVMFlang builds, including Debug and RelDebug. ROCm documents -fopenmp-target-fast as requiring its constraints to be satisfied and implying extra assumptions (e.g., ignoring target env vars, applying -O3 when no -O* flag is set), which can alter debugging and performance-tuning behavior. Additionally, hardcoding --offload-arch=gfx90a limits this path to MI250X-class GPUs; the codebase's own FIXME comment at line 645 (in the GNU section) acknowledges this limitation.
Make the GPU architecture configurable via CMake and gate the fast assumptions behind Release build type and/or an explicit option.
Suggested direction
+option(MFC_AMD_OPENMP_FAST "Enable LLVMFlang AMD OpenMP fast target assumptions" ON)
+set(MFC_AMDGPU_ARCH "native" CACHE STRING "AMD GPU architecture for LLVMFlang OpenMP offload")
elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang")
- target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription)
- target_link_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a)
+ target_compile_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}")
+ target_link_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}")
+
+ if (MFC_AMD_OPENMP_FAST AND CMAKE_BUILD_TYPE STREQUAL "Release")
+ target_compile_options(${a_target} PRIVATE
+ -fopenmp-target-fast
+ -fopenmp-assume-threads-oversubscription
+ -fopenmp-assume-teams-oversubscription)
+ endif()
endif()
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1374 +/- ##
==========================================
+ Coverage 64.62% 64.66% +0.04%
==========================================
Files 71 71
Lines 18407 18432 +25
Branches 1516 1517 +1
==========================================
+ Hits 11895 11919 +24
Misses 5555 5555
- Partials 957 958 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Daniel Vickers <danieljvickers@login09.frontier.olcf.ornl.gov> Co-authored-by: Daniel Vickers <danieljvickers@frontier00007.frontier.olcf.ornl.gov> Co-authored-by: Daniel Vickers <danieljvickers@login12.frontier.olcf.ornl.gov> Co-authored-by: wilfonba <bwilfong3@gatech.edu> Co-authored-by: Daniel Vickers <danieljvickers@frontier01665.frontier.olcf.ornl.gov>
Description
Loop unwrapping to optimize AMD Flang GPU performance for WENO.
Type of change
Testing
Performance checking of the
examples/3D_performance_testcaseAI code reviews
Reviews are not triggered automatically. To request a review, comment on the PR:
@coderabbitai review— incremental review (new changes only)@coderabbitai full review— full review from scratch/review— Qodo review/improve— Qodo code suggestions@claude full review— Claude full review (also triggers on PR open/reopen/ready)claude-full-review— Claude full review via label